| Variable | Description |
|---|---|
| manufacturer | Car manufacturer |
| model | Car model |
| year | Year of manufacture |
| displ | Engine displacement (litres) |
| hwy | Miles per gallon (highway) |
| cty | Miles per gallon (city) |
| cyl | Number of cylinders |
| drv | Drive type (f = front, r = rear, 4 = 4wd) |
| class | Type of car |
| trans | Type of transmission |
| fl | Fuel type |
Data Visualization with R
MSDA - Bootcamp 2025 Summer
ggplot2
it starts from the grammar of graphics Wickham (2016)
- data
- aesthetics
- geoms
- facets
- stats
- scales
- coordinates
- themes
- Aesthetics
- specifiy how we want our data to map onto our plot
- Which variable belongs on the x-axis? What about the y-axis?
- Are we going to convey additional dimenisons of data with colour, or shape, or opacity?
- specifiy how we want our data to map onto our plot
- Scale
- When setting scales, we need to allow for easy data visualisation
- Most of the time we’ll use a linear scale
- but can also use other options such as geometric, or logarithmic, if the data is distributed differently and would better suit these transformations
- Geoms
- Geoms are the actual visual elements that we use to represent our data
- Points, lines, bars, etc.
- Geoms are the building blocks of our plot
- Geoms are the actual visual elements that we use to represent our data
- Statistics
- We need to think about summarising our data
- Statistics are used to summarise the data
- Facets
- Facets allow us to create multiple plots that each display a subset of the data
- This is useful for comparing different groups or categories within the data
ggplot2
- Every ggplot2 plot has three key components:
- data
- A set of aesthetic mappings between variables in the data and visual properties
- At least one layer which describes how to render each observation
- Layers are usually created with a geom function
ggplot2 - data illustration
- Use built-in dataset from ggplot2: mpg
- information about the fuel economy of popular car models in 1999 and 2008
- collected by the US Environmental Protection Agency
- here are some of the variables in the dataset:
+ manufacturer, model, year
+ displ (engine displacement in litres)
+ hwy (miles per gallon on the highway)
+ cty (miles per gallon in the city)
+ cyl (number of cylinders)
+ drv (f = front-wheel drive, r = rear wheel drive, 4 = 4wd)
+ class (type of car)
+ trans (type of transmission)
+ fl (fuel type)
The mpg dataset is a tibble, a modern version of a data frame
The mpg dataset is part of the ggplot2 package
The mpg dataset is a tidy dataset
This dataset suggests many interesting questions
- How are engine size and fuel economy related?
- Do certain manufacturers care more about fuel economy than others?
- Has fuel economy improved in the last ten years?
List five functions that you could use to get more information about the mpg dataset
How can you find out what other datasets are included with ggplot2?
Apart from the US, most countries use fuel consumption (fuel consumed over fixed distance) rather than fuel economy (distance travelled with fixed amount of fuel). How could you convert cty and hwy into the European standard of l/100km?
Which manufacturer has the most models in this dataset?
- Which model has the most variations?
- Does your answer change if you remove the redundant specification of drive train (e.g. “pathfinder 4wd”, “a4 quattro”) from the model name?
ggplot2
- Let us plot the relationship between engine size and fuel economy
. . .
. . .
- How would you describe the relationship between displ and hwy?
ggplot2
Colour, size, shape and other aesthetic attributes
- Aesthetics are visual properties of the objects in the plot
- colour, size, shape, linetype, fill, alpha
- Aesthetics can be mapped to variables in the data
- aes(colour=variable)
- aes(size=variable)
- aes(shape=variable)
- aes(linetype=variable)
- aes(fill=variable)
- aes(alpha=variable)
ggplot2
Colour, size, shape and other aesthetic attributes
. . .
. . .
ggplot2 takes care of the details of converting data (e.g., ‘f’, ‘r’, ‘4’) into aesthetics (e.g., ‘red’, ‘yellow’, ‘green’) with a scale
- There is one scale for each aesthetic mapping in a plot.
- The scale is also responsible for creating a guide, an axis or legend, that allows you to read the plot, converting aesthetic values back into data values
The scale functions are:
- scale_colour_manual()
- scale_size_manual()
- scale_shape_manual()
- scale_linetype_manual()
- scale_fill_manual()
- scale_alpha_manual()
What happens when you map them to continuous values?
What about categorical values?
What happens when you use more than one aesthetic in a plot?
ggplot2 — labels
- Labels are important for making your plot understandable
- xlab() and ylab() functions
- labs() function
. . .
ggplot2
ggthemes
Code
library(ggthemes)
ggplot(mpg,
aes(displ, hwy)) +
geom_point(aes(color=class)) +
labs(x="Engine size (litres)",
y="Highway fuel economy (miles per gallon)",
title="Relationship between engine size and fuel economy",
color="Car type",
caption="Source: mpg dataset")+
theme_economist()+
scale_color_tableau() +
theme(
axis.title.x = element_text(margin = margin(t = 10)),
axis.title.y = element_text(margin = margin(r = 10))
)ggplot2 — Facets
- Facets allow you to create multiple plots that each display a subset of the data
- facet_wrap() creates a grid of plots
- facet_grid() creates a matrix of plots
. . .
ggplot2
Plot geoms
- Geoms are the geometric objects that represent the data in the plot
- geom_point() creates a scatterplot
- geom_smooth() creates a smoothed line plot
- geom_histogram() creates a histogram
- geom_boxplot() creates a boxplot
- geom_bar() creates a bar plot
- geom_line() creates a line plot
- geom_vline() adds a vertical line to the plot
- geom_hline() adds a horizontal line to the plot
- geom_abline() adds a diagonal line to the plot
ggplot2
Adding a smoother to a plot
ggplot2
ggplot2
Boxplots
ggplot2
Bar plots
- Bar plots are useful for visualizing the distribution of a categorical variable
ggplot2
Histograms and density plots
- Histograms and density plots are useful for visualizing the distribution of a continuous variable
ggplot2
Histograms and density plots
ggplot2
ggsave - save the graph as an image file
Code
ggsave(filename="mpg_displ.png",width=6, height=4)Final Example - toy imports to the US from 1996-2005
- it is drawn from Scott (2021)
Code
library(tidyverse)
toy_imports <- read_csv("https://raw.githubusercontent.com/kwan-MSDA/Bootcamp_2024/main/dataset/toyimports.csv")
head(toy_imports)# A tibble: 6 × 8
partner year partner_name product product_name US_report_import pop2000
<chr> <dbl> <chr> <dbl> <chr> <dbl> <dbl>
1 ARE 1998 United Arab Emira… 950341 "Toys repre… 1.06 3.25e6
2 ARE 2000 United Arab Emira… 950349 "Toys repre… 12.0 3.25e6
3 ARE 2003 United Arab Emira… 950349 "Toys repre… 4.65 3.25e6
4 ARE 2005 United Arab Emira… 950320 "Reduced-si… 49.2 3.25e6
5 ARG 1996 Argentina 950341 "Toys repre… 0 3.69e7
6 ARG 1996 Argentina 950310 "Electric t… 10.8 3.69e7
# ℹ 1 more variable: region <dbl>
. . .
- Task: make a graph showing total toy imports over time for the U.S.’s top 5 trading partners by total dollar value of toys imported
Final Example - toy imports to the US from 1996-2005
Code
country_total<- toy_imports %>%
group_by(partner_name) %>%
summarize(total_import=sum(US_report_import)) %>%
arrange(desc(total_import)) %>%
head(5)
country_total# A tibble: 5 × 2
partner_name total_import
<chr> <dbl>
1 China 26842305.
2 Denmark 1034990.
3 Canada 572309.
4 Hong Kong, China 545186.
5 Switzerland 400969.
the total dollar value of toys imported to the U.S. (US_report_import, in multiples of $1,000) in a specific product category from a specific country in a specific year
The product categories have unique numerical codes (product) as well as product names exciting enough to quicken the heart of any toy-loving child (“Parts and accessories :– Other,” “Toys representing animal or non-human figures,” and so on
Group all the observations by trading partner (the partner_name variable)
For each partner, calculate total dollar value by summing toy imports (US_report_import) across all categories and years
Arrange the partners by total dollar value
Final Example - toy imports to the US from 1996-2005
Code
#| out-width: 100%
top5_partners=c("China", "Denmark", "Canada", "Hong Kong, China", "Switzerland")
options(scipen = 999)
library(ggthemes)
library(scales)
library(plotly)
p <- toy_imports %>%
filter(partner_name %in% top5_partners) %>%
group_by(year, partner_name) %>%
summarize(total_import=sum(US_report_import)) %>%
ggplot(aes(year, total_import, color=partner_name)) +
geom_line()+
labs(title="Toy imports from the U.S.'s top-5 partners, 1996-2005",
x="Year",
y="Dollar value of imports (log scale)",
color="Import Region")+
scale_x_continuous(breaks=1996:2005)+
theme_economist()+
scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
labels = trans_format("log10", math_format(10^.x)))
ggplotly(p)the five coldest months in Rapid City from 1995 to 2011
Code
library(tidyverse)
rapidcity <- read_csv("https://raw.githubusercontent.com/kwan-MSDA/Bootcamp_2024/main/dataset/rapidcity.csv")
rapidcity %>%
group_by(Year, Month) %>%
summarize(avg_Temp = mean(Temp),
lowest_temp = min(Temp),
hightest_temp = max(Temp)) %>%
arrange(avg_Temp) %>%
head(5) %>%
round(1)# A tibble: 5 × 5
# Groups: Year [4]
Year Month avg_Temp lowest_temp hightest_temp
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1996 1 14.9 -11 46.1
2 2009 12 16.4 -2.6 35.6
3 2000 12 17.3 -9 38.8
4 1996 12 17.5 -10.8 40.4
5 2001 2 17.6 -3.9 40.8
- Import the data set (we’ve done this already).
- Split the data set into individual months in individual years: January 1995, February 1995, March 1995, and so on, all the way through December 2011.
- For each individual month, calculate the average of the Temp variable (along with any other summaries we might find interesting).
- Sort the individual months according to their average temperatures.
- Make a table of the five coldest months
survival on the Titanic
Q: how did survival among adult passengers vary by sex and cabin class?
Code
titanic <- read_csv("https://raw.githubusercontent.com/kwan-MSDA/Bootcamp_2024/main/dataset/titanic.csv")
head(titanic)# A tibble: 6 × 5
name survived sex age passengerClass
<chr> <chr> <chr> <dbl> <chr>
1 Allen, Miss. Elisabeth Walton yes female 29 1st
2 Allison, Master. Hudson Trevor yes male 0.917 1st
3 Allison, Miss. Helen Loraine no female 2 1st
4 Allison, Mr. Hudson Joshua Crei no male 30 1st
5 Allison, Mrs. Hudson J C (Bessi no female 25 1st
6 Anderson, Mr. Harry yes male 48 1st
Code
surv_adults<- titanic %>%
mutate(Adult = age >= 18) %>%
filter(Adult) %>%
group_by(sex, passengerClass) %>%
summarize(total_count=n(),
survived = sum(survived=="yes"),
survival_rate = survived/total_count)
surv_adults# A tibble: 6 × 5
# Groups: sex [2]
sex passengerClass total_count survived survival_rate
<chr> <chr> <int> <int> <dbl>
1 female 1st 125 121 0.968
2 female 2nd 85 74 0.871
3 female 3rd 106 47 0.443
4 male 1st 144 47 0.326
5 male 2nd 143 12 0.0839
6 male 3rd 289 45 0.156
Code
library(ggthemes)
ggplot(surv_adults) +
geom_col(aes(x=sex, y=survival_rate)) +
facet_wrap(~passengerClass, nrow=1)+
labs(title="Survival rate by gender and passenger class",
y="Survival rate",
x="gender")+
theme_economist()- create a new variable, which we’ll call Adult, that determines whether a passenger is at least 18 years old.
- filter the data set down to adults only.
- group the filtered data set by sex and cabin class (2 sexes× × 3 classes = 6 groups).
- calculate the survival percentage for each group.
Extra: Gapminder data
Code
library(gapminder)
data(gapminder)
gapminder %>%
group_by(year, continent) %>%
mutate(median_lifeExp = median(lifeExp)) %>%
ggplot(aes(year, median_lifeExp, color=continent)) +
geom_line()+
labs(title="Life expectancy by continent and year",
x="Year",
y="Life expectancy")+
theme_economist()Extra: Gapminder data
this is from BBC style
Code
# install.packages('devtools')
#devtools::install_github('bbc/bbplot'))
library(ggpubr)
source("https://raw.githubusercontent.com/kwan-MSDA/R/main/bbc_style.R")
gapminder %>%
group_by(year, continent) %>%
summarize(median_lifeExp = median(lifeExp)) %>%
ggplot(aes(year, median_lifeExp, color=continent)) +
geom_line()+
labs(title="Life expectancy by continent and year",
x="Year",
y="Life expectancy")+
bbc_style()Extra: Gapminder data
Code
library("ggalt")
library("tidyr")
library(gapminder)
dumbbell_df <- gapminder %>%
filter(year == 1967 | year == 2007) %>%
select(country, year, lifeExp) %>%
spread(year, lifeExp) %>%
mutate(gap = `2007` - `1967`) %>%
arrange(desc(gap)) %>%
head(10)
#Make plot
ggplot(dumbbell_df, aes(x = `1967`, xend = `2007`, y = reorder(country, gap), group = country)) +
geom_dumbbell(colour = "#dddddd",
size = 3,
colour_x = "#FAAB18",
colour_xend = "#1380A1") +
bbc_style() +
labs(title="We're living longer",
subtitle="Biggest life expectancy rise, 1967-2007")Extra: Gapminder data
Code
library(hrbrthemes)
library(viridis)
gapminder %>%
filter(year==2007) %>%
mutate(country=factor(country, levels=unique(country))) %>%
arrange(desc(pop)) %>%
ggplot(aes(x=gdpPercap, y=lifeExp, size=pop, fill=continent)) +
geom_point(alpha=0.6, shape=21, color="black")+
scale_size(range=c(.1, 24), name="Population (M)")+
scale_fill_viridis(discrete=TRUE, guide=FALSE, option="A")+
theme_ipsum()+
theme(legend.position="none")+
labs(title="Life expectancy by continent in 2007",
x="GDP per capita",
y="Life Expectancy")Extra: Gapminder data
Code
library(gganimate)
gapminder %>%
ggplot(aes(x=gdpPercap, y=lifeExp, size=pop, fill=continent, frame=year)) +
geom_point(alpha=0.6, shape=21, color="black")+
scale_size(range=c(.1, 22), name="Population (M)")+
scale_fill_viridis(discrete=TRUE, guide=FALSE, option="A")+
theme_ipsum()+
theme(legend.position="none")+
labs(title="Life expectancy by continent in {frame_time}",
x="GDP per capita",
y="Life Expectancy")+
geom_text(data=gapminder %>% filter(pop >1e+8), aes(label=country), size=5, nudge_x=0.1, nudge_y=0.1)+
transition_time(year)+
enter_fade()+
exit_fade()